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5IJ , Abstract. This paper presents our work on development of OWL-driven 

^ ' systems for formal representation and reasoning about terminological 

knowledge and facts in petrology. The long-term aim of our project is to 
00 ■ provide solid foundations for a large-scale integration of various kinds of 

knowledge, including basic terms, rock classification algorithms, findings 
and reports. We describe three steps we have taken towards that goal 
. here. First, we develop a semi-automated procedure for transforming a 

. ' database of igneous rock samples to texts in a controlled natural language 

^ , (CNL), and then a collection of OWL ontologies. Second, we create an 

OWL ontology of important petrology terms currently described in nat- 
ural language thesauri. We describe a prototype of a tool for collecting 
definitions from domain experts. Third, we present an approach to for- 
^ ' malization of current industrial standards for classification of rock sam- 

^— »' I pies, which requires linear equations in OWL 2. In conclusion, we discuss 

a range of opportunities arising from the use of semantic technologies in 
petrology and outline the future work in this area. 
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1 Introduction 



5^ , Petrology, a branch of geology studying rocks and their formation, plays an im- 

portant role in describing Earth's crust structure, which is essential for revealing 
patterns in distribution of mineral resources. Similar to other natural sciences, 
a wealth of knowledge requiring a proper management (especially with regard 
to consistency) and integration has been accumulated in petrology. These tasks 
could be approached more efhciently, if the knowledge had been machine pro- 
cessable, in particular, if a formal theory of petrology (i.e. a system of axioms, 
definitions and theorems [11], p. 33) had been available. Ontologies, especially 
OWL ontologies, are well suited for playing the role of a cornerstone of such 
theory, as they have been remarkably successful in other sciences, e.g., bioinfor- 
matics, chemistry, and health care. 

This paper describes our steps towards developing a formal theory of petrol- 
ogy. We focus on identifying basic terms, providing definitions to other commonly 
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used terms i.e., terms used in industrial standards, and namely, rock types such 
as rhyolite or harzburgite, and formalizing the basic set of axioms. We use OWL 
as a main formalization tool enabling us, in particular, to automatically check 
our representation for consistency. 

It is only natural to start developing a theory by identifying the important 
terms to be later used for representing facts, e.g., knowledge about specific rock 
samples. Such facts are typically stored in relational databases in modern petrol- 
ogy, so relational databases can be used as a source of terms. We describe the 
conversion of one such database, namely Proba [5 (Sample in Russian), to a 
collection of OWL ontologies containing facts expressed using an initial set of 
currently undefined terms in the [2] section. 

Once the terms have been identified, we proceed to their formalization, i.e., 
writing their definitions in OWL. First, it is essential to define the basic terms, 
which can be used to define all other terms. Currently available definitions are 
usually stored in a semi-structured form in natural language thesauri. Besides 
other issues, this often leads to contradictions, especially given differences be- 
tween schools in petrology. We use one such thesaurus, namely the Glossary 
of Igneous Rocks f7| , to define petrological terms and relationships in an OWL 
ontology. In addition, we develop a webProtege-based tool to enable domain ex- 
perts to work collaboratively on term definitions, in particular, to agree upon 
them. See the [3] section for details. 

Finally, we complement the ontology by using another rich source of term 
definitions - internationally adopted scientific recommendations describing rock 
sample classification methodologies, e.g. Igneous Rocks: A Classification and 
Glossary of Terms [10| . The S] section describes an approach to extracting def- 
initions from the standard and expressing them as OWL axioms. As it stands, 
OWL 2 is insufficient for a complete capture of terms semantics (as specified 
in the standard), but this would be possible if path free linear equations were 
adoptedl3 We conclude the paper by summarizing our experience from the de- 
scribed work and outlining plans for the future. 



2 Formalizing Facts: From Database to OWL 

A considerable amount of important information is saved in databases, but in the 
form of data, which, unfortunately, is not a knowledge and requires an essential 
and laborious processing to obtain knowledge. This section describes a direct 
way of getting knowledge from the data: database conversion to the traditional 
form of knowledge, i.e. knowledge in a natural language. The natural language 
is limited to CNL to make this knowledge machine processable. We follow T. 
Kuhn: CNLs are subsets of natural languages that are restricted in a way that 
allows their automatic translation into formal logic, p. 5 9^. We consider CNL as 
a universal tool for representing a formal ontological knowledge. 

^ The details of the proposed extensions are available at 
'http://www.w3 . org/TR/owl2-dr-rinear/ Our work on petrology may be 
viewed as a use case for supporting linear equations in future OWL versions. 
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The original database. Proba DB [5 contains data from 1,174 scientific arti- 
cles (Bibliography table) about 49,285 samples of igneous rocks (Measurements 
table). Samples are collected all over the globe, which is reflected in the Lo- 
calities, llocal, Iglobal and Igroup tables. The samples are assigned a rock type 
(Rocks table) , a genesis type (Errupttypes table) , age (ages table) , and, which is 
the main thing, weight percentage (Concentrations table) of chemical substances 
and isotopes (list in the Elements table). 

This brief description alone already shows that table and column identifiers 
can only approximately match the terms used by petrologists to exchange sample 
data. The transition to CNL also solves the problem of converting the data saved 
in RDB to knowledge in a form directly understandable to experts in the subject 
domain. 

CNL sentences. List 1 includes examples of all types of CNL sentence required 
to present all facts contained in the Proba DB. Local (internal) proper names 
required to name various objects within the knowledge base are used in the sen- 
tences. So, PUB5633 is the name of article number 5633 (from bibliography.id) 
in the DB. SAM32994 is the name of sample number 32994 (from measure- 
ments. id) in the DB, etc. Words are connected by letter "_" in compound terms. 
The text also contains well-known global proper names, for example, Iceland, 
Atlantic-Ocean. 

List 1. Example of CNL sentences. 
PUB5633 is a publication. 

A title of PUB5633 is "A CONTRIBUTION TO THE GEOLOGY OF THE K...". 
SAM32994 is a sample. SAM32994 is a rhyolite. 
PUB5633 describes SAM32994. 

PLC32994 is a place. PLC32994 is a part of Iceland. 

A gathering_place of SAM32994 is PLC32994. 

SUB469812 is a substance. SAM32994 includes SUB469812. 

WPC469812 is a weight_percent . A value of WPC469812 is 73.95. 

A component of WPC469812 is SUB469812. 

The sentence structure is very simple. A very limited natural language is 
actually required to record all facts contained in a RDB if RDB is normalized. 
But RDB Proba is normalized not everywhere. Completing normalization is one 
of the tasks of reorganizing a DB to enable automatic conversion to knowledge. 
Rules of mapping the RDB content to CNL have been developed. These rules 
are the specification for SQL-scripts dumping RDB to CNL text |15) . 

OWL ontology: getting and analysis. All generated sentences are ACE lan- 
guage |3] sentences, and are selected so that a concrete APE compiler could 
compile them to OWL. A portion of the knowledge contained in each article is 

^ Attempto Parsing Engine ,http: //attempto . if i .uzh. ch/site/tools/| 
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separated as a text (ACE file) to be converted to an independent ontology (DL 
species is AL(D)). Thus, the DB will be converted to 1,174 ontologies. Columns 
values mainly form attribute values, but also class names (rhyolite, harzburgite) 
and individual names (Iceland). Let's consider the ontology obtained for an arti- 
cle with a DB number of 5633. The obtained classes, properties and individuals 
are listed below. 

Classes: place, publication, rhyolite, sample, substance, weight_percent. 
Object properties: component, describes, gathering_place, includes, mixture, part. 
Data properties: authorial_number, chemicaLformula, first_page, journal_reference, 
last_page, latitude, longitude, reference, title, value, year. 
Individuails: Atlantic_Ocean, Iceland etc. 

All the terms used except rhyolite refer to contexts outside of petrology and 
even geology. These are the contexts of geography (place, etc), scientific publi- 
cations (publication etc), solid state physics (sample, substance, weight_percent 
etc), chemistry (chemical_forniula). The rest of the report focuses on obtaining 
rock type definitions, including that for rhyolite. 

3 Formalizing Terminology: Prom Natural Language to 
OWL 

The ontology of the facts specifies that the part of names used for classes, rela- 
tions, individuals belongs to a different ontology (vocabulary). This dictionary 
ontology is supposed to provide term definitions, and the author of the arti- 
cle has exactly this understanding in mind. Such scientific terms are normally 
already collected in a dictionary, for example, Petrographic Dictionary [12], Dic- 
tionary of Geological Terms U , Dictionary of Igneous Rocks Terms [7^ , Glossary 
of Geology [T]. The dictionary represents a very important and specific type of 
knowledge. It is based on subject domain terms and informal definitions of these 
terms. Example: harzburgite rock type article from |10| . p. 88: 

HARZBURGITE. An ultramafic plutonic rock composed essentially of 
olivine and orthopyroxene. Now defined modally in the ultramafic rock 
classification (Fig. 2.9, p. 28). (Rosenbusch, 1887, p. 269; Harzburg, Harz 
Mts, Lower Saxony, Germany; Troeger 732; Johannsen v. 4, p. 438; Tomkeieff 
p.247) 

We have converted a specific dictionary ([7.) initially presented by authors as 
an html page to an OWL ontology. We begin the formalization of relations be- 
tween terms (for example, synonymy) and term properties (for example, become 
outdated). 

Converting the dictionary text to ontology. We took the Dictionary of 
Terms of Igneous Rock Types compiled by the Interdepartmental Petrographic 
Committee in the Department of Earth Sciences of the Russian Academy of 
Sciences [7]. The dictionary contains 1,567 articles, the overwhelming majority 
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of them being rock names. The dictionary structure and conversion procedures 
required to get the ontology are described in |13j and most important below. 
Vocabulary: Words are connected by letter "_" in compound terms. 
Article title: The dictionary article title contains a Russian term and its English 
equivalent in a simple case, but its both Russian and English synonyms are often 
specified as well. Each term present in the title generates an ontology class. Thus, 
the ontology will contain classes in Russian and in English. All terms from one 
title are considered synonyms, i.e. their classes are declared equivalent. These 
conversions resulted in 3,179 classes and 1,659 class equivalence axioms having 
appeared in the ontology. 

The text of the article: The basic dictionary article text parts are: term definition, 
comment, list of links to references (normally at the end), term origin description 
(normally located on the list of references after the article, in which the term was 
introduced). Comments and a list of links to references located in some parts 
of the ontology in the form of separate annotations are supposed to be selected 
from the text of the article. 

The dictionary ontology (DL species is ALUF(D)) is publishecil and can be 
viewed using any ontology browser at this moment. 

Collective management of scientific term definitions. Another copy of 
the ontology is accessible by means of webProtege installed on the Geology 
porta{f|. The dictionary ontology is 'die' there. 

It is important that a prefix and a namespace be assigned to each dictionary. 
We have for terms of the ontology itself, terms from the Moscow State University 
Geoweb portal, terms from the Petrographic Code of Russia [5] , and terms from 
the [7] dictionary, respectively: 

prefix die: <//earth. jscc.ru/ontologies/dic.owl#> 
prefix gwr: <//wiki.web.ru/wiki#> 

prefix pgcc: <//www. igem.ru/site/petrokomitet/code#> 
prefix pgc : <//www. igem.ru/site/petrokomitet/slovar#> 

A formal term meaning definition is critical for developing a formal theory. 
For example, the current version of the dictionary provides a formal definition 
of the abessedite rock type (see Portlet Axioms for dic:abessedite), and namely 

peridotite and mineralsjnixture and 

containsjnineral only (olivin or hornblende or phlogopite) 

This formula is written using the Manchester OWL syntax. It is important 
that petrologists are able to read it. The process of obtaining a formal (math- 
ematical) definition, especially in a form clear to experts, is described further, 

^ http: //earth. j sec .ru/ontologies/dic . owl| 

* 'http: //protegewiki . Stanford. edu/in dex .php/WebProtege| 
^ http: //earth. j sec .ru/webprotege/ 
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and is one of project's main ultimate goals. The jlS] report contains details of 
the work done. 

4 Formalizing Rock Classification 

Rules of rock type assignment to samples are described in [10] and consist of a 
description of initial-classification algorithm and diagrams of final classification 
by percentage of essential minerals. We begin with a specification of all parts 
of the algorithm, sample data being its input and term (word combination) 
representing sample rock type its output. The algorithm is written as a set of 
functions in the form of a flowchart clear to petrologists. 

The algorithm uses some real- valued functions and unary predicates. These 
functions and predicates are supposed to have value on any solid [?;. Some of 
these functions and predicates have been given definitions, definitions should be 
found for other ones, and some will probably remain without definitions and 
will enter in the formal theory as primary ones. The algorithm and necessary 
definitions are given for ultramafic types of plutonic rock as an example. It 
is shown then how to get formal definitions of some types of rock from the 
algorithm. 

VPC means mineral Volume Percentage Content of the sample and is also 
known as "volume modal data" . 

We name an algorithm function (for example, ultramafic j:ock_type) re- 
ceiving sample data at its input and returning a sample rock type name classi- 
fying. 

Quantitative and Qualitative Characteristics. We need unary real-valued 
functions returning the volume percentage of minerals in a solid. The full set of 
minerals required for the algorithm will be gradually clarified. 

The following functions of one argument returning a real number were re- 
quired till now: VPCjnelilite, VPC_kalsilite, VPC_leucite, VPC_D1, VPC_Opx, 
VPC_Cpx, VPCJiornblende, VPC_garnet, VPC_spinel, and VPC_biotite. These 
functions are primary and may be measured. 

We also need the VPC of groups of minerals (see [10] p. 4, [6] p. 6): VPC_Q, 
VPC_A, VPC_P, VPC_F and VPC_M. It is clear that these functions have definitions. 
The VPC_M definition is given below. 

The following unary predicates will be required to describe the sample: 
pyroclastic, kimberlite, lemiproite, lamprophyre, charnockite, plutonic, 
and volcanic. All of these predicates are supposed to have definitions. The 
definition of pyroclastic is given below. 

Definitions. All the definitions currently available can be found in a technical 
report [2]. We show typical examples here. All definitions are based on two 
sources: "Igneous Rocks: A Classification and Glossary of Terms" and 'BGS 
Rock Classification Scheme" [6], and are confirmed by petrologists. 
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VPC_Px: the modal content of pyroxenes (required to classify some plutonic 
rocks): 

VPC_Px(x) =dcf VPC_Opx(x) + VPC_Cpx(x) 
Where =def means by definition [16]. 

VPC_ODC and VPC_DPH: VPC of mineral groups. We need these definitions to 
formalize the diagrams on Fig. 2.9, p. 28 of [10 . 

VPC_OOC(x) =dcf VPC_01(x) + VPC_Opx(x) +VPC_Cpx(x) 
VPC_OPH(x) =dcf VPC_01(x) + VPC_Px(x) + VPCJiornblende(x) 

VPC_M: returns volume percentage of group M (mafic) minerals in the sample (p. 
4, 28 see [TO], and especially [5] p. 6). Following the direct instructions given in 
[6] p. 6: 

M = mafic and related minerals, that is all other minerals apart from 
QAPF;... 

we obtain the definition; 

VPC_M(x) =dot 100 - (VPC_Q(x) + VPC_A(x) + VPC_P(x) + VPC_F(x)) 

pyroclastic : We mainly rely on the 2.2 PYROCLASTIC ROCKS AND TEPHRA 
section [TO], p. 7. 

pyroclastic (x) =def clastic (x) A (Vy clast (y) A part_of (y ,x) 

— >■ volcanic_eruption_result (y)) 

This can also be represented in DL: 

pyroclastic = clastic □ V(part_of o id(clast) )^.volcanic_eruptionjresult 

Algorithm. Our algorithm is a further formalization (and elaboration!) of the 
classification rules provided in the }10j . The algorithm is written as a set of func- 
tion flowcharts, the main function being the classifying rock_type function. This 
function should be invoked to classify a sample. We have also created flowcharts 
for the ultramafic rock classifying function and two diagrams on Fig. 2. 9 [TO], p. 
28: OOC_diagram_field (the upper triangle) and OPH_diagram_field (the lower 
triangle). The lUGS diagram flowcharts are deliberately presented as a chain 
of if-nodes, each one being responsible for one specific diagram area. Each if- 
condition represents a system of linear inequalities. The set of such conditions 
has important mathematical properties: 

— Any two conditions are incompatible, since areas corresponding to them are 
mutually disjoint 

— The union of all conditions gives inequalities for a triangle, since conditions 
cover the entire triangle 

It is important that the described properties can be checked automatically if 
definitions are loaded in a reasoner working with linear inequalities. 
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Rock type predicate definition. The classification algorithm implicitly con- 
tains definitions of all types of igneous rock. Definitions can be obtained from 
the algorithm in the form of formulas one free variable formulas of predicate cal- 
culus of first order with numbers. The formula structure shows the complexity 
of the concept behind the term, and also specifics all the concepts underlying 
a term. This is extremely important for finding the primary concepts. We have 
quite formally, i.e. using mathematical conversions, obtained formulas for the 
harzburgitc and dunite predicates. 

harzburgite : when applied to the sample, the harzburgite predicate should 
give "true" if the sample is harzburgite, and "false" otherwise. Flowcharts have 
to be tracked from top to bottom, and conditions leading to a OOC_diagram_field 
flowchart node producing the "harzburgite" value collected, to get a predicate. 
These conditions should be connected by the logical operation "and" . The con- 
versions will give the following formula: 

harzburgite (x) = dcf plutonic (x) A -> (pyroclastic(x) V kimberlite(x) 
V lamproite(x) V lamprophyre (x) V charnockite (x) ) 
A VPC_carbonates(x)< 50 A VPCjielilite (x) < 10 A VPC_M(x) > 90 
A VPCJ5;alsilite(x)=0 A VPC_leucite (x)=0 A VPCJiornblende (x)=0 
A 0.4*VPC_0DC(x)< VPC_Dl(x)< 0.9*VPC_00C(x) 
A VPC_Cpx(x)<0.05*VPCJD0C(x) 

Thus, a precise definition of the harzburgite igneous rock type consists of three 
parts: 

1. Qualitative characteristics (lines 1, 2). 

2. Absolute restrictions on modal data (lines 3, 4). 

3. Relative restrictions on modal data (lines 5, 6). 

Now we can compare this definition with the informal definition quoted in 

Section 3: the formal definition is more complete. It docs not suppose anything 
and does not refer to the diagram. It contains the necessary part of the diagram. 

5 Lessons Learnt, What is Next? 

This paper describes our experience of converting the petrological information 
stored in databases, glossaries, and classification standards to a formal OWL- 
based representation. A similar approach, i.e. one based on providing unambigu- 
ous and consistent definitions for all terms, can be used in developing a formal 
theory for virtually any scientific area. We will now briefiy summarize the results 
and outline plans for the future. 

From data to knowledge. Moving from a database of petrological facts to a knowl- 
edge base is beneficial from multiple perspectives. Firstly, the new representation 
is richer and enables generation of sentences in a controlled natural language, 
which, in our experience, are understandable to geologists. They can be used 
not only as an interface to the KB, but also to annotate publications, which 
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should lead to increased amounts of machine-processable metadata. Secondly, 
the KB (equipped with a CNL-based interface and a SPARQL endpoint) can be 
integrated with the ontology that provides the vocabulary. This is important for 
ensuring a consistent use of the terminology across all information systems us- 
ing the KB. The stored knowledge can be further integrated with other available 
datasets, e.g. those provided by the EarthChem consortium, [f] 
Centralized vocabulary. Providing a controlled vocabulary is essential for manag- 
ing the knowledge. In our case, it was most important to collect the terms used 
in the database in a single OWL ontology, and give them unambiguous defini- 
tions along with human-readable annotations. This is a substantial improvement 
compared to the previous situation where terms were defined informally and in 
multiple, often contradictory sources. The resulting system can be used both as 
a dictionary (for people and applications i.e., via SPARQL) and as a tool for 
collaborative work on terminology. 

Rock classification. The formal definitions of the terms captured in standard 
OWL are not detailed enough to support automated rock sample classification, 
which is one of the most important use cases in petrology. To this end, we have 
investigated the possibility of complementing the definitions with quantitative 
restrictions on their mineral composition. Such restrictions can be defined using 
linear equations, a possible extension to the current data ranges in OWL 2. 

Similarly to databases and glossaries, the classification recommendations, 
namely [10| . are sometimes ambiguous and incomplete as well, so their formal- 
ization requires collaboration with petrologists from the Subcommission on the 
Systematics of Igneous Rocks of the International Union of Geological Sciences. 
However, we managed to identify some predicates and functions requiring defi- 
nitions, which can be used as building blocks of a formal theory. Following the 
methodology described in the H] section, we have obtained detailed definitions 
for two types of rock as well as for some auxiliary terms. We plan to extend this 
work to cover all rock types in the classification. 

Our work enables answering questions like Is a current object a sample of 
a certain rock? by performing instance checking, a standard reasoning task in 
OWL. However, this can be extended to query answering to find all possible rock 
types for a specific sample or to find all samples of a specific type in the KB. 
This, however, requires reasoning with linear inequalities, which is not supported 
at large scale at the moment (some reasoners are available, e.g. RACER). 

Finally, we would like to stress that our approach to formalization differs 
from what can be seen in many biological and chemical ontologies. They are 
often deep class hierarchies with numerous asserted subsumptions between class 
names and with relatively few definitions. We focus on providing detailed defi- 
nitions (using standard OWL and linear equations) instead, and plan to rely on 
automated reasoners to build and maintain the hierarchy. This may enable use 



EarthChem is a community-driven effort to facilitate the preservation, discovery and 
visualization of and access to the broadest and richest geochemical datasets possible: 
http : / /www . earthchem . org 
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of the ontologies in a broader range of situations as illustrated by rock sample 
classification. 
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